DOCOHENT BESOHE 



BD 090 313 



TH 003 602 



AUTROB 
TITLE 



FOB DATE 
NOTE 



EDBS PBICE 
DBSCBIPTOBS 



Bicolichr Hark; And Others 

Demonstration of Techniques for Optiaizing the Ose of 
CriterioB*Bef erenced Achievement Tests in Children 
Enrolled in a Day Care Center. 
7a 

12p.; Paper presented at the Annual Meeting of the 
American Educational Besearch Association (Chicago r 
Illinois^ Aprilr 1974) 

MF-$0.75 HC-$1.50 PLOS POSTAGE 

"^^Achievement Gains; Achievement Tests; ^Criterion 
Beferenced Tests; Day Care Programs; Educational 
Diagnosis; Feedback; Instructional Programs; *Item 
Analysis; ^Measurement Techniques; Multiple Choice 
Tests; Preschool Children; ♦Preschool Evaluation; 
Program Evaluation; Statistical Analysis; Suniative 
Evaluation 



ABSTBACT 

This investigation describes (a) use of an item 
analysis technique applied to pretest results of Tests of Basic 
Experiences (Moss, 1971) with children 3-5 years old enrolled in a 
day care center; and (b) development of a nev statistical technique 
fcr evaluating change ratings between pre- and posttests. A technique 
of obtaining mean change ratings for each itemr determining 
significance of these ratings r and comparing item categories based on 
instructional objectives showed certain means for categories included 
in instruction significantly different from categories not included 
and certain mean change ratings significantly different from zero for 
categories of items included in instruction. (Author) 



ERLC 



O 

o 



0*1 



U J OEPARTMENTOp: HEALTH. 
EDUCATION ft WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 
THIS OOCUMENT MAS BEEN RCPRO 
OUCEO EXACTLY AS RECEIVED PROM 
THE PERSON OR ORGANIZATION ORIGIN 
ATING IT POrNTS OF VIEW OR OPINIONS 
STATEO 00 NOT NECESSARILY REPRE 
SENT OFFICIAL NATIONAL INSTITUTE OF 
EOUCATION POSITION OR POLICY 



Demonstration of Techniques for Optimizing the Use of Criterion-referenced 
Achievement Tests in Children Enrolled in a Day Care Center 

Mark Nicollch^ Lorraine Nicolich, & Jane Baph 
Rutgers University 



o 



This investigation describes (a) use of an item analysis technique 
applied to pretest results of Tests of Basic Experiences (Moss, 1971) with 
children 3-5 years old enrolled in a day care center; and (b) develc^anent of 
a new statistical technique for evaluating change ratings between pre*- and 
posttests* 

A technique of obtaining mean change ratings for each item> determining 
significance of these ratings and cozsparing item categories based on 
instructional cbjectives showed certain means for categories included in 
instruction significantly different from categories not indtided and certain 
mean change ratings significantly different ttcm zero for cat^orles of items 
Included in instruction. 



ERIC 



DEMQISrSTRATIOM OP TBC3IN33ftUES FOR OETIMIZIWG THE USE OP OTITERiaH-REPERENCED 
ACHIEVEMEIIT TESTS HI CHnJXiEN ENBOLLED IN A MY CARE CENTER 

Msork Nicolich^ Lorraine Hlcollch^ Jane Baph 
Rutgers Uhiversity 

Objectives 

In developing an evaluation plan useful for Iboth instructional and stmrnatlve 
evaluation of teaching effectiveness at a day care center, a refinement of the 
gross nature of current preschool evaluation seemed imperative. An effort vas 
made (a) to involve the use of an item analytic techniqixe to assist teachers In 
using pretest results of children as a means of identifying areas of knowledge, 
or lack thereof, and as an aid in instructional planning; and (b) to develop a 
new statistlcsil technique for evaluating item change ratings on achlevement-tjrpe 
tests together with suggested applications for presch<x)l evaluation* 

Theoretical Frainework 
With recent emphasis in educational funding on accountability and on 
performance-related objectives, the limitations of techniques currently in use 
for n^asming prograin effectiveness have heccsne increasingly apparent. Day care 
is a relatively new phenomenon in the United States. Problems of accountability 
have arisen frcm lack of agreement regarding the nattire of day care programs, 
i.e., custodial care or developmental education; generally low reliability of 
meastxrement instruments for use with young dilldren; and questions z^arding 
validity of the measures for demonstrating growth related to curricular 
objectives. 



Research on effectiveness of early education escperiences in the past has 
relied heavily on IQ tests. The Inapproprlateness of these tests for measuring 
program outcomes as veil as their xise in a multiracial, pluralistic society has 
been documented ftequ^sntly (Hunt, 196I, 1961f). The demonstrated fact of large 
Initial ]Q gains during the first year of sdiool entrance and then the leveling 
off of such gains is also a shortcoming of traditional IQ measures (Klaus and 
Gray, I968). 

With respect to achievement-type tests for young children, if the measuring 
instrument is directly related to the instructional program, not only can 
individual gxwth be shown, but socres can also be used to assess program 
strengths and limitations as well as to form a basis for Instructional planning 
for the succeeding year (Sigel, 1973; Stodolsky, in press)* One mlg^ht also add 
that, "vrtille logically, the deteimlnation of curricular objectives should clearly 
precede selection of evaluative approaches, in fact, the many new ventures in 
day care have not clearly defined their objectives. Such centers might profit 
from an ecu^ly assessment in the school year of each child's girasp of education- 
ally relevant concepts. These centers might profit als6 by observing system- 
atically the changes each child demonstrate^? dxiring enrollment in a year's 
program. 

Techniques 

The techniques to be described here are designed to be used with a multiple 
dioice, achievement t<yp6 test. For maximum benefit, it Is desirable to have ihe 
test items grouped, or able to be grouped, into specific curricular areas. The 
techniques refer first to an item analysis as a basis for planning, and second 



3. 

to the analysis of item dutnge ratings as the basic data fk»om 'which different 
evaluation qiiestions nay he answered. 

It is of note that the statistical principles imnolved are basic, and 
represent the application of first principles to an apparently complex problem. 
Uhile the techniques do not consider all possible ramifications, they do point 
a vay for future application. 

Item Analysis for Planning 

An Item Analysis computer program (Nicolich, 1972) has been developed \diicJa 
allows each teacher to determine the specific strengths and weakness of the class 
or of subgroups within the class. The program will score the test, print infor- 
mation on t^ie score distiributlon, class mean and standard deviation. The item. 
aneJysis section of the program yields the percent coarrect for each test item, 
and the percent of students dioosing a given response for each item. The list- 
ing of students response is presented for 3 groups: the total class, those 
students in the upper 27% of the test grade distribution, and those students 
in the lower 27% of the distribution. The teacher can then, from the percent 
correct, determine areas of wealmess. From the responses the teacher can 
determine if there are particular misconceptions or general lack of knowledge 
in poorly answered areas. In addition, data on a pasrticular item may indicate 
its cgspropriateness or inapprqpriateness for a particular group* That is, if 
a particular Item has few correct responses, investigation of the most firequently 
answered distractor may show poor item construction for "Uae particular student 
group under consideiation. When tests are constTucted for national distribution 
it is likely that some items will be inappropriate for a particular group of 



preschool children* ihe aasuznptlon of similar edxicational experience is 
frequently not tenable for sdhool aged children, but the assunption ^en 
applied to preschoolers becomes clearly tmtenable* Not only is there vide 
program variation, but various heme influences, such as object uses, verbal 
expressions and other idiosyncrasies are far more evident in the responses of 
these bbildren vho have had only brief and relatively recent experience with 
society outside the home. In paxi^icular, the upper and lower 27^ groupings 
can be investigated to determine if these giwxps have an owirall need different 
from the total class. In a homogeneous SES group yihixtb, is on the fringe of the 
dominant cxature, (as in the data to be described here) the performance of the 
upper 27f) can be used as a rou^ Index of the appropriateness of the items for 
the total group in general. 

The information can then be used to help the teacher to plan the year's 
curriculum to best benefit Utie students. It is not meant to tell the teacher 
\&ich particular Individual items to teach to allow a large "gain score" for 
that year. 

A frequent deficiency of preschool achievement tests is a lack of alternate 
forms. Thus the pre and post tests are often the same questions. Ameliorating 
this deficiency in the testing procedure is lack of feedback to the stxidents as 
to vhibh responses were correct, and the relatively long period between testings. 
However, until alternate forms are available, it is an important (if obvious) 
administrative detail to avoid teaching with particular itemis in mind. 

Analysis of Item Cihanp^e Ratings 

Item change ratixigs . In conceptualizing the effect of Instructional 
pzt^grams in adxievement terms, one could say that the purpose is that each child 

ERLC 



5. 



add to his Imowledge in certain areas. With respect to any element of Imowledge 
(saDipled by a test Item) It seems reasonable to cz'edit the Item with one point 
(-^1) If an Item that was answered Incorrectly on the pretest Is answered 
correctly on the posttest, to debit one point (-1) If the reverse holds, and to 
glvB the Item no credit (O) for the Items on >Ailch the child exhibits no change 
on the two test admlnstratlcns. From pre« and posttest data an Item change 
rating (+1,0, -l) Is assigned to each Item for each sxsbdect. An average change 
rating over all subjects In a given class can then be determined for each Item. 
The mean change rating for each Item then becomes the basic data for statistical 
analysis. If we keep In mind the caveat of Slgel (1973, P»108) to avoid 
"prematiwe foreclosure (of data analysis) by formal statistical analysis,'^ It 
is possible to scan the mean 3:atlngs, Inspect Inconsistencies, and examl^ie 
relationships ^Ich might lead to the gyrating of what Slgel terms "more 
re€aistlcally complex statements of hypotheses." Such Informal Inspection of 
data sets can suggest the type of analysis most appropriate for a given evalu- 
ation goal. Described herein are several approaches to utilizing the item change 
ratings, each related to a different evaluation question. 

Evaluation of significance of item change ratings . To investigate the 
statistical filgnificance of item dhange ratines it will be assumed that rondom 
guessing applies and that each response is equally likely for eexHi item. In 
what follows, it is assumed that there are four reaponaea for each item. (If 
there ai^ other than four responses, or the responses are not considered equally 
likely the analysis could proceed ^long similar lines making the necessary 
changes in the probabilities). 

The probcibillty that an item is incorrectly answered on the pretest and 
conectly answered on the posttest (both being chosen at random fsrom aoong four 

ERLC 



6. 



possible) ^ • ^ * ^ • "^^^ probability of a +1 item rating is ^ • 

Similarly, the probability of a 0 or -1 rating is found to be ^ and ^ 

respectively. The mean and variance of the item diange rating is 0 and 0.373 • 
TbxxB, under the null hypothesis of random choice on pretest and posttest, 
atie distribution of item change siting is unimodal and syimnetric. A form of 
the Caxnp^Mieidell inequality (Rao, 196^) states: 

P(|x-(i| ^ Xct) 

^ere ^ and a are the mean and standard deviation of the distribution of and 
X is any positive constant. For the problem at hand; Hhe rejection region for 
the null hypothesis of guessing, versus the alteinative hypothesis of learning 
taking place, at the .05 level of significance, is mean diange ratings larger 
than 1.29//n, where n is the number of students Included In the mean change 
rating. This may be restated as: a mean change rating larger than 1.29//n 
indicates an Increase in learlng of test items at the .05 level o£ significance. 

If n le large, then the Central Limit Theorem obtains, and the rejection 
region fcr the same hypotheses at the % level of significance is 1.00//n^ 

To determine if a particular item vas learned (in accordance with the 
aforementioned schema), the mean change score for that item vould be compared 
with 1.29//nj if the mean change score was laiger the itaa was learned. If not 
the item vas not learned. A cosaputer program, Nicolich (1973)> has been written 
to provide the appropriate statistics to test such hypotheses. 

Bi addition to testing individual items, this techniqtxe can be used to test 
groups of items. The group would be ccmio^ised of items which come from a 
particular curriculum area. By calculating the mean change score for all items 
in the group, for all students ana. cozEparlng this wi-Ui 1.29/Am, the learning 



ERIC 



7. 



in the curriculum area could be tested (k represents the number of items in the 
curriculum group considered. )• Again^ for large saiqples ^en the Central Limit 
Zheorem obtains ^ the critical value vould be l,00//k«n • 

Evaluation of Teacher CtoalS t In addition to testing if groups of similar 
curriculum items are leaxned) it is possible to use the item change ratings to 
evaluate teacher instructional goals. The teacher is asked to divide test 
measured curriculum categories into one of three areas: to be included In class-* 
Toom, instruction, definitely not to be part of the instjniction, and to be an 
ancillary part of the instruction program. Note, individual test items are not 
considered, only curriculum categories; this precludes the teacher Instructing 
specifically for individtial test items. 

The item change scores are then averaged for each of Hie three areas. It 
is expected that the area included in instruction would have a larger mean item 
score than the area not Included in instruction. Fisher's randomization tech- 
nique using Snedecor*s F (Bradley, I968) would be used tor this test. The 
cttL^iculum area means could also be tested to determine if they differed 
significantly from zero, again using 1.29//n^ \*iere k is the nusiiber of items 
Included in the area. 

If the mean of the curriculum grcnap of taught items is significantly 
greater then for iiie group of act taiaght items, then the teadier could conclude 
that "ttxe goals were met. If not,then the conclusion would be either that the 
particular group of taught items were not learned any better than the not 
taught items, or that the non taught items were learned in some other setting. 
It would be reasonable to then test if the mean of the taugjit items category 
was significantly larger than zero. 

ERLC 



8 



Application In A Day Care Setting 
The following represent some results from a study ^ere the principles 
outlined above were applied. 
Subjects 

Siibjects vere enrolled In an Inner-dty day care facility. The dilldren 
were In tbree classrooms, two preschool classes of cshUdren ages 3- and If-years 
(N=31) and one kindergarten class of children age 5-years (N=17)- AH children 
vere either black or Spanlsh^speaklng. 
Instruments 

The Language and Mathematics subtests of the Tests of Basic ibcperlences 
(TQBE, Moss, 1971) were selected as crlterlon-refdi^ced mectsures of the 
educational program planned for the children. Cazden (Buros, 1972) concluded 
In her review of these tests that the design and condltloxis of admini stration 
vere probably as good as can be obtained for ^lldr^^n l^ls age* 
Admlnl st rat Ion 

The tests were administered to Individual dilldren or In groups of not more 
than two or three. Spanish language Instructions Included In the TCBE manual 
were utilized vhen appropriate by a native Spanish- speaking tester. Tests were 
given in the late fall and in the late spring of the same school year. 
Criterion-referenced Sources 

Within each subtest of the TGBE, items weare designated by the authors of 
the TOBE as sazDpllng relatively specific currlcular areas > not merely factual 
Information. For example, positional and contextual meaning are contained in 
the Language subtest. Geoi^trical shapes and measurement are aieas tapped in 
the MathemJbics subtest. A list of the areas of knowledge contained in the 



9. 

sTibtests *waa given to each teacher vbo divided Ute list Ixrbo three categories: 
those areas definitely Included in her Instructional progrw; those definitely 
not included; and those areas of an ancillary nature. 
Results 

In applyizig these techniques to the performance of children on the TOBE 
in the three classrooms , it wlb fotmd that in one class a Fisher's Randomization 
Technique analysis Indicated that the means for items in the category included 
in instruction, accoi'dlng to the teacher, were significantly different at alpha 
• 05 from the means for items in the category not included. 

3h testing Aether any of the three categories of items had means different 
from zero, results indicated that for each clajss on each test, the category of 
items designated as definitely included in instruction had mean change ratings 
different ftom zero at alpha of ..05. In only one class did the category of 
items not included in instruction achieve chaiige different from zero. This 
exception vas attributed to greater flexibility in that particular class idiere 
opportunities for independent learning may have been maximized by the teacher. 
Bxe category of ancillary items had means different from zero (again at alpha 
.05) in half of the subjects tested. 

For each group of children in each class on ectch test there was a clearly 
indicated group of items indicating significant cihajage "which was aaenable to 
interpretation with relation to the cuzriculum* 



ERLC 



10. 

References 

Bradley, J, V, Dlstrlbutlon^fl^e statistical tests > Englewood Cliffs, New- 
Jersey: Prent Ice-Hall, 1968. 

Cazden, C. Review of Tests of Basic Experience. In. 0. Burps (Ed«), The 
seventh mental meaaurement yearbook s Hl^iland Park, New Jersey: 
Gryphon Press, 1972. 

Hunt, J. McV. Intelligence and experience ^ New York: Ronald Press, I961. 

Hunt, J, McV. The psychological basis for using preschool enrlchiaent as an 
iintldote for cultural deprivation. Merrill-Palmer Quarterly, 196U^ 10, 
209-21*8. 

Klaus, R. A. , & Gray, S. The early training project for disadvantaged 
children: A report after five years • Monographs of the Society for 
Research in Child Development ^ I968, 33 (U, Whole No. 120). 

Moss, M. H. Examiner's Meuxual for Tests of Basic Experiences . Monterey, 
California: CTB/McGraw-Hill, 1971. 

Nicolich, Mark J. An Item Analysis Program for Multiple Choice Examinations. 

Available from the author for the cost of reproduction and mailing. (1972). 

Nicolich, Mark J. Item Change AnaOysis Program for Pre-Post Jixltiple Choice 
Examinations. Available from the author for the cost of reproduction and 
mailing. (1973). 

Rao, C. R. Linear Statistical Inference and its Applications . New York, N. Y. : 
Wiley and Sons, I965. 



ERLC 



u. 



Sigel, I. B. Where is preschool education gotog? Assesflment isi a Pluralistic 

Society : Prooeedlggs of the 1972 Invitational Conference on Te8t:lng 

Problems ^ 1973* Pp* 99-116. 
Stodolslsy^ S. 8. Defining treatment and outcome In early childhood education. 

3ja H. Walberg (Ed.), Rethinking Urban Eaucation > New York: Jossey-Bass, 

In press. 



